RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after it collided with an iceberg during its maiden voyage from UK to New York City. There were an estimated 2224 passengers and crew aboard the ship, and more than 1500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. Below is the journey map of Titanic:
Fig 1: Titanic Journey Map
There are 891 observations and 12 variables in this dataset. This dataset consists only 40% of the overall 2254 passengers and crew on-board of Titanic.
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Fig 2: Variables Description
Above are the descriptions of these variables. There are some extra explanations for these variables:
The missing values in each variable are shown next:
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
The variable ‘Age’ and ‘Cabin’ are missing a substantial amount of values. ‘Embarked’ has 2 missing values.
Summary of the data:
## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 C :168
## 1st Qu.: 7.91 Class :character Q : 77
## Median : 14.45 Mode :character S :644
## Mean : 32.20 NA's: 2
## 3rd Qu.: 31.00
## Max. :512.33
##
As for numerical variables such as Age, SibSp, Parch, and Fare, histograms will be used to visualise the distributions of these variables.
The variables ‘Age’, ‘Cabin’, and ‘Embarked’ have missing values. ‘Cabin’ will be dropped from the data since it has 687 missing values. As for the ‘Age’, it has 177 missing values and we will deal with that later. In this section, ‘Embarked’ which only has 2 missing values will be imputed. Below shows which observations have the missing ‘Embarked’ value:
## PassengerId Survived Pclass Name
## 62 62 1 1 Icard, Miss. Amelie
## 830 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 62 female 38 0 0 113572 80 B28 <NA>
## 830 female 62 0 0 113572 80 B28 <NA>
We can see that both of the passengers are female, traveled with 1st class ticket and both paid 80 for the fares. The fare might be able to tell which ports they were embarked on.
| Embarked | Pclass | MedianFare |
|---|---|---|
| C | 1 | 78.2667 |
| C | 2 | 24.0000 |
| C | 3 | 7.8958 |
| Q | 1 | 90.0000 |
| Q | 2 | 12.3500 |
| Q | 3 | 7.7500 |
| S | 1 | 52.0000 |
| S | 2 | 13.5000 |
| S | 3 | 8.0500 |
Median fare value is used since the distribution of fare is irregular. Fare of 80 is very close to the median fare value of 1st class ticket embarked from Cherbourg. We will impute the 2 missing value with Cherbourg.
## PassengerId Survived Pclass Name
## 62 62 1 1 Icard, Miss. Amelie
## 830 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 62 female 38 0 0 113572 80 B28 C
## 830 female 62 0 0 113572 80 B28 C
The missing values in ‘Age’ will be dealt in later section.
Feature enginnering is an approach to create additional relevant features (variables) from the existing raw data so that more insights can be exploited and then the predictive power of the predictive models can be increased.
There are SibSp and Parch indicate how many siblings, spouses, parents and children a passenger traveled aboard with. We want to create another feature which combine both of these variables to indicate the total family members included the passenger traveled aboard the Titanic, called FamSize.
df$FamSize <- df$SibSp + df$Parch + 1
With that feature, we can also create another feature to indicate whether the passenger traveled alone, called IsAlone where IsAlone = 1 if FamSize = 1 and IsAlone = 0 if FamSize > 1.
df$IsAlone <- if_else(df$FamSize > 1, 0, 1)
df$IsAlone <- as.factor(df$IsAlone)
The Fare is a continuous variable, so we want to classify the Fare into few bins (groups). qcut function is used so that the Fare is grouped by its quartile ranges.
df$FareBin <- qcut(df$Fare, cuts = 4)
df$FareBin <- as.factor(df$FareBin)
There are 177 missing values in Age variable. Just by replacing with the mean/ median age might not be the best solution because the age might be differing by categories of passengers. The social title might provides clues to impute the missing age.
| Title | MedianAge |
|---|---|
| Master | 3.5 |
| Miss | 21.0 |
| Mr | 30.0 |
| Mrs | 35.0 |
| Noble | 48.5 |
From the boxplot above, there is a distinction in distribution of age by different titles. Median age of each title will be imputed to the missing values according to the passengers’ titles.
df$Age[df$Title == "Master" & is.na(df$Age)] <- 3.5
df$Age[df$Title == "Miss" & is.na(df$Age)] <- 21
df$Age[df$Title == "Mr" & is.na(df$Age)] <- 30
df$Age[df$Title == "Mrs" & is.na(df$Age)] <- 35
df$Age[df$Title == "Noble" & is.na(df$Age)] <- 48.5
Now that the missing ages are imputed, we want to look at the age distribution by survival status to glean any insights.
Those younger than around 16 had a high survival rate which provides good information. The age band between 24 and 80 shows survival is unfavourable. It is not very informative as most passengers did not survive to begin with. We will classify the passengers in this new feature AgeGroup by those who were 16 years and below and those who were not.
df$AgeGroup <- if_else(df$Age <= 16, "<=16", ">16")
df$AgeGroup <- as.factor(df$AgeGroup)
With data processing and cleaning done, we now want to explore the data more deeply with the help of visualisation tools.
Note: Go through each tabs for different variables, some of the plots are interactive, hover mouse cursor to view the data presented.
74.2% of females survived the disaster which is far above the mean survival rate of 38.4%, whereas only 18.9% of males survived which is far below the mean survival rate. This shows females survived at a higher rate than the males.
63% of passengers traveled with 1st class ticket survived, 47.3% of passengers traveled with 2nd class ticket survived, and only 24.2% of passengers traveled with 3rd class ticket survived, compared to the mean survival rate of 38.4%.
Children and teens aged 16 and below survived at 54.8%. Passengers who were above 16 years old survived at 36.2% which is closed to the mean survival rate.
In this section, we will look at SibSp, Parch, and FamSize.
There is no clear takeaway from how family size could affect the survival rate.
Passengers who were traveling alone had a lower survival rate (30.4%) than those traveling with family members (50.6%).
Those embarked on Cherbourg had the highest survival rate (55.9%), followed by those emabrked on Queenstown (39%) and Southampton (33.7%).
The survival rate increased as the fare is getting higher. We will explore this by looking at the breakdown of fare groups (FareBin).
Those who bought the fare for 31 and above survived at the highest rate (58.1%) followed by those bought the fare for 14.5 - 31 (45.5%). Those who bought the fare below 7.91 survived at the rate of 19.7%.
The survival rate of Miss (70.3%) and Mrs (79.4%) are the highest while the survival rate of Mr (15.7%) is the lowest, which in line with the finding that females have a very much higher survival rate than males. The survival rate of Master (57.5%) is also in line with the finding that children survived at higher rate. Passengers with Noble title have a survival rate of 34.8%.
Note: Go through each tabs for different variables, some of the plots are interactive, hover mouse cursor to view the data presented.
Female passengers with 1st class tickets almost all survived (96.8%), the high survival rate also hold for female passengers with 2nd class tickets (92.1%), whereas the female passengers with 3rd class tickets had a survival rate of 50%.
The same pattern hold for male passengers where 1st class survival rate > 2nd class survival rate > 3rd class survival rate, but with much lower survival rate compared to female passengers.
Those aged 16 years old and below traveled with 1st and 2nd class tickets had a very high survival rate (88.9% and 90.5% respectively). Those aged above 16 years old traveled with 1st class ticket also survived at a high rate (61.8%) which further indicate that the survival odds of 1st class passengers are high.
Interestingy, among the females, those who were aged above 16 years old survived at higher rate than those who were 16 years old and below. Whereas among the males, those who were 16 years old and below survived at much higher rate those those who were aged 16 years old and above.
Interestingly, for the 1st class and 2nd class ticket, the median fare for those who survived is higher than those who did not.
Masters traveled in 1st and 2nd class all survived (100%) whereas Masters traveled in 3rd class had survival rate of 39.3%. As established before, Miss and Mrs traveled at 3rd class had survival rate of 50%. Nobles traveled only in 1st class and 2nd class, those in 1st class had a survival rate of 53.3% whereas those in 2nd class did not survive. Mr had the lowest survival rate amongst all the Title groups, although the odds to survive is slightly higher if they traveled in 1st class.
Passengers who were traveling alone and embarked from Queenstown had a higher survival rate (40.35%) compared to those embarked from Queenstown but traveling with family members (35%).
Interestingly, female passengers traveled alone with 3rd class ticket actually had a high survival rate (61.67%). Generally, male passengers traveled alone had slightly lower survival rate in 1st class and 3rd class.
The following flow diagram is called Alluvial diagram and it is plot using alluvial package. The maroon coloured bands represent those who survived, whereas the grey coloured bands represent those who did not survive. The larger the band width, the larger the amount of passengers represented in the band.
The Alluvial diagram helps summarise what we found in Sex and Pclass, AgeGroup and Pclass, and Sex and AgeGroup:
The following flow diagram is called Alluvial diagram and it is plot using alluvial package. The violet coloured bands represent those who survived, whereas the grey coloured bands represent those who did not survive. The larger the band width, the larger the amount of passengers represented in the band.
Male passengers of aged 16 and below actually did not travel alone. Female passengers of any age and of any ticket class had higher survival odds even if they were traveling alone.
With data processing, cleaning, and exploration done, the next step will be to predict the survival rate of Titanic disaster. Go through each tabs for different machine learning (ML) models. 5-fold cross validation will be used in each models to help fine tuning hyperparameters and to determine the models’ performances. Cross-validation accuracy is also a good proxy to estimate the test accuracy of the model. Logistic Regression model is served as baseline model.
The dependent variables selected are Pclass, Sex, Embarked, Title, and IsAlone. AgeGroup is not selected because there were many missing data in Age. We do not want to introduce unnecessary noises to the model despite Ages are imputed reasonably. Besides, Title contains enough relevant information regarding the age groups. FareBin is also not selected because Pclass already contains much of the information regarding FareBin as suggested in the correlation heatmap.
## Generalized Linear Model
##
## 891 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 712, 713, 713, 713, 713
## Resampling results:
##
## Accuracy Kappa
## 0.8102756 0.5956287
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5298 -0.6559 -0.3648 0.6235 2.5448
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 18.0663 509.3457 0.035 0.971705
## Pclass2 -0.9229 0.2731 -3.380 0.000726 ***
## Pclass3 -2.1721 0.2478 -8.766 < 2e-16 ***
## Sexmale -15.4123 509.3455 -0.030 0.975860
## EmbarkedQ -0.2667 0.3815 -0.699 0.484559
## EmbarkedS -0.6963 0.2380 -2.926 0.003433 **
## TitleMiss -15.4295 509.3457 -0.030 0.975834
## TitleMr -2.9836 0.4160 -7.172 7.37e-13 ***
## TitleMrs -14.9933 509.3457 -0.029 0.976516
## TitleNoble -3.3816 0.6860 -4.929 8.25e-07 ***
## IsAlone1 0.5216 0.2211 2.359 0.018315 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 763.77 on 880 degrees of freedom
## AIC: 785.77
##
## Number of Fisher Scoring iterations: 13
The Logistic Regression model has a cross-validation accuracy of 81%. The model estimation result shows that:
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1
## 0 53.2 10.5
## 1 8.4 27.8
##
## Accuracy (average) : 0.8103
On average, of the 549 passengers who did not survive, Logistic Regression model can correctly classify 474 of them (53.2% of 891). Whereas, of the 342 passengers who survived, this model can correctly classify 247 of them (27.8% of 891).
## Naive Bayes
##
## 891 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 712, 713, 713, 713, 713
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.7901011 0.5502373
## TRUE 0.7800201 0.5144534
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
## and adjust = 1.
The optimised Naives Bayes classifier is the one without kernel. The cross-validation accuracy of tuned Naive Bayes classification model is 79%.
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1
## 0 52.5 11.9
## 1 9.1 26.5
##
## Accuracy (average) : 0.7901
On average, of the 549 passengers who did not survive, Naive Bayes model can correctly classify 467 of them (52.5% of 891). Whereas, of the 342 passengers who survived, this model can correctly classify 236 of them (28.4% of 891).
## CART
##
## 891 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 712, 713, 713, 713, 713
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.02046784 0.8159689 0.5913317
## 0.04093567 0.7901324 0.5488081
## 0.43274854 0.7092461 0.3148347
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02046784.
The optimal Decision Tree model is the simplest possible model where the complexity parameter is about 0.0205. The cross-validation accuracy of tuned Decision Tree model is about 81.6%.
Title being Mr. is the most important feature for the model as it appears at the top internal node. If a passenger is Mr., he will not survive. If a passenger is not a Mr., then the next important feature is the Pclass. If the passenger is not a Mr. and is not traveling in 3rd class, he/ she will survives. But if that passenger is traveling in 3rd class and embarked from Southampton, he/ she will not survive.
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1
## 0 57.4 14.1
## 1 4.3 24.2
##
## Accuracy (average) : 0.8159
On average, of the 549 passengers who did not survive, Decision Tree model can correctly classify 511 of them (57.4% of 891). Whereas, of the 342 passengers who survived, this model can correctly classify 215 of them (24.2% of 891).
## Random Forest
##
## 891 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 712, 713, 713, 712, 714
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8137318 0.5905738
## 6 0.8260726 0.6108300
## 10 0.8215845 0.6029080
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
The optimised Random Forest model is the model with parameter mtry = 6. This means that at each split during the trees building process, 6 variables are randomly sampled as split candidates out of the 10 variables. The cross-validation accuracy of tuned Random Forest model is about 82.6%.
For the Random Forest model, the top 3 most important features are TitleMr, Sex, and Pclass3. This shows that males, particularly adult males (Mr.) are very important characteristics in predicting whether a passenger will survives. 3rd class ticket is also a very important feature in predicting whether a passenger will survives. These are in accordance with the findings in EDA where we know that Mr and Males had a very low survival rate, and 3rd class passengers also had a very low survival rate.
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1
## 0 58.4 14.1
## 1 3.3 24.2
##
## Accuracy (average) : 0.826
On average, of the 549 passengers who did not survive, Random Forest model can correctly classify 520 of them (58.4% of 891). Whereas, of the 342 passengers who survived, this model can correctly classify 215 of them (24.2% of 891).
## k-Nearest Neighbors
##
## 891 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 712, 713, 713, 712, 714
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8170774 0.5900708
## 7 0.8058729 0.5643255
## 9 0.8069964 0.5643497
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
The optimised k-Nearest Neighbour model is the 5-Nearest Neighbour model with cross-validation accuracy of 81.7%.
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1
## 0 58.1 14.8
## 1 3.5 23.6
##
## Accuracy (average) : 0.8171
On average, of the 549 passengers who did not survive, 5-NN model can correctly classify 517 of them (58.1% of 891). Whereas, of the 342 passengers who survived, this model can correctly classify 210 of them (23.6% of 891).
## Support Vector Machines with Linear Kernel
##
## 891 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 712, 713, 713, 713, 713
## Resampling results across tuning parameters:
##
## cost Accuracy Kappa
## 0.25 0.7923483 0.5639831
## 0.50 0.7934718 0.5665311
## 1.00 0.7934718 0.5665311
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cost = 0.5.
The optimised parameter cost for the Support Vector Machine is 0.5. The cross-validation accuracy of tuned Support Vector Machines with Linear Kernel is 79.3%.
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1
## 0 50.6 9.7
## 1 11.0 28.7
##
## Accuracy (average) : 0.7935
On average, of the 549 passengers who did not survive, SVM with Linear Kernel model can correctly classify 450 of them (50.6% of 891). Whereas, of the 342 passengers who survived, this model can correctly classify 255 of them (28.7% of 891).
## Multi-Layer Perceptron
##
## 891 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 712, 713, 713, 713, 713
## Resampling results across tuning parameters:
##
## size Accuracy Kappa
## 1 0.8192581 0.5949875
## 3 0.8012994 0.5667443
## 5 0.8103007 0.5793093
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was size = 1.
## SNNS network definition file V1.4-3D
## generated at Fri May 31 23:42:54 2019
##
## network name : RSNNS_untitled
## source files :
## no. of units : 13
## no. of connections : 12
## no. of unit types : 0
## no. of site types : 0
##
##
## learning function : Std_Backpropagation
## update function : Topological_Order
##
##
## unit default section :
##
## act | bias | st | subnet | layer | act func | out func
## ---------|----------|----|--------|-------|--------------|-------------
## 0.00000 | 0.00000 | i | 0 | 1 | Act_Logistic | Out_Identity
## ---------|----------|----|--------|-------|--------------|-------------
##
##
## unit definition section :
##
## no. | typeName | unitName | act | bias | st | position | act func | out func | sites
## ----|----------|------------------|----------|----------|----|----------|--------------|----------|-------
## 1 | | Input_Pclass2 | 0.00000 | 0.09533 | i | 1, 0, 0 | Act_Identity | |
## 2 | | Input_Pclass3 | 1.00000 | -0.28010 | i | 2, 0, 0 | Act_Identity | |
## 3 | | Input_Sexmale | 1.00000 | -0.00406 | i | 3, 0, 0 | Act_Identity | |
## 4 | | Input_EmbarkedQ | 1.00000 | 0.07664 | i | 4, 0, 0 | Act_Identity | |
## 5 | | Input_EmbarkedS | 0.00000 | 0.16939 | i | 5, 0, 0 | Act_Identity | |
## 6 | | Input_TitleMiss | 0.00000 | -0.05483 | i | 6, 0, 0 | Act_Identity | |
## 7 | | Input_TitleMr | 1.00000 | -0.08896 | i | 7, 0, 0 | Act_Identity | |
## 8 | | Input_TitleMrs | 0.00000 | -0.04093 | i | 8, 0, 0 | Act_Identity | |
## 9 | | Input_TitleNoble | 0.00000 | -0.01046 | i | 9, 0, 0 | Act_Identity | |
## 10 | | Input_IsAlone1 | 1.00000 | -0.12019 | i | 10, 0, 0 | Act_Identity | |
## 11 | | Hidden_2_1 | 0.99751 | -6.01455 | h | 1, 2, 0 |||
## 12 | | Output_0 | 0.88963 | -2.79980 | o | 1, 4, 0 |||
## 13 | | Output_1 | 0.11037 | 2.79980 | o | 2, 4, 0 |||
## ----|----------|------------------|----------|----------|----|----------|--------------|----------|-------
##
##
## connection definition section :
##
## target | site | source:weight
## -------|------|---------------------------------------------------------------------------------------------------------------------
## 11 | | 10:-0.43498, 9: 4.91658, 8: 1.05091, 7: 5.73250, 6: 1.49778, 5: 1.14598, 4: 0.27498, 3: 1.64406, 2: 4.79153,
## 1: 1.81651
## 12 | | 11: 4.89891
## 13 | | 11:-4.89891
## -------|------|---------------------------------------------------------------------------------------------------------------------
The optimised size of Multi-Layer Perceptron model is 1. The cross-validation accuracy of tuned Perceptron model is about 81.9%.
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1
## 0 58.4 14.8
## 1 3.3 23.6
##
## Accuracy (average) : 0.8193
On average, of the 549 passengers who did not survive, Multi-Layer Perceptron model can correctly classify 520 of them (58.4% of 891). Whereas, of the 342 passengers who survived, this model can correctly classify 210 of them (23.6% of 891).
## eXtreme Gradient Boosting
##
## 891 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 713, 713, 713, 712, 713
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds Accuracy
## 0.3 1 0.6 0.50 50 0.7879104
## 0.3 1 0.6 0.50 100 0.7890340
## 0.3 1 0.6 0.50 150 0.8013433
## 0.3 1 0.6 0.75 50 0.7935158
## 0.3 1 0.6 0.75 100 0.8024543
## 0.3 1 0.6 0.75 150 0.8069613
## 0.3 1 0.6 1.00 50 0.7946457
## 0.3 1 0.6 1.00 100 0.7935158
## 0.3 1 0.6 1.00 150 0.8013370
## 0.3 1 0.8 0.50 50 0.7980102
## 0.3 1 0.8 0.50 100 0.8114494
## 0.3 1 0.8 0.50 150 0.8002071
## 0.3 1 0.8 0.75 50 0.7991338
## 0.3 1 0.8 0.75 100 0.7968866
## 0.3 1 0.8 0.75 150 0.8024543
## 0.3 1 0.8 1.00 50 0.7946457
## 0.3 1 0.8 1.00 100 0.7935158
## 0.3 1 0.8 1.00 150 0.7946394
## 0.3 2 0.6 0.50 50 0.8226853
## 0.3 2 0.6 0.50 100 0.8294206
## 0.3 2 0.6 0.50 150 0.8260498
## 0.3 2 0.6 0.75 50 0.8249388
## 0.3 2 0.6 0.75 100 0.8260498
## 0.3 2 0.6 0.75 150 0.8260498
## 0.3 2 0.6 1.00 50 0.8260561
## 0.3 2 0.6 1.00 100 0.8283033
## 0.3 2 0.6 1.00 150 0.8316678
## 0.3 2 0.8 0.50 50 0.8294206
## 0.3 2 0.8 0.50 100 0.8282970
## 0.3 2 0.8 0.50 150 0.8226853
## 0.3 2 0.8 0.75 50 0.8283033
## 0.3 2 0.8 0.75 100 0.8282970
## 0.3 2 0.8 0.75 150 0.8282970
## 0.3 2 0.8 1.00 50 0.8260561
## 0.3 2 0.8 1.00 100 0.8282970
## 0.3 2 0.8 1.00 150 0.8316678
## 0.3 3 0.6 0.50 50 0.8305505
## 0.3 3 0.6 0.50 100 0.8260624
## 0.3 3 0.6 0.50 150 0.8283096
## 0.3 3 0.6 0.75 50 0.8316678
## 0.3 3 0.6 0.75 100 0.8238089
## 0.3 3 0.6 0.75 150 0.8305505
## 0.3 3 0.6 1.00 50 0.8305505
## 0.3 3 0.6 1.00 100 0.8294143
## 0.3 3 0.6 1.00 150 0.8294143
## 0.3 3 0.8 0.50 50 0.8204507
## 0.3 3 0.8 0.50 100 0.8226791
## 0.3 3 0.8 0.50 150 0.8260498
## 0.3 3 0.8 0.75 50 0.8260498
## 0.3 3 0.8 0.75 100 0.8282970
## 0.3 3 0.8 0.75 150 0.8282970
## 0.3 3 0.8 1.00 50 0.8271797
## 0.3 3 0.8 1.00 100 0.8282970
## 0.3 3 0.8 1.00 150 0.8282970
## 0.4 1 0.6 0.50 50 0.7845396
## 0.4 1 0.6 0.50 100 0.7968489
## 0.4 1 0.6 0.50 150 0.8081163
## 0.4 1 0.6 0.75 50 0.7968489
## 0.4 1 0.6 0.75 100 0.8058377
## 0.4 1 0.6 0.75 150 0.8058377
## 0.4 1 0.6 1.00 50 0.7923985
## 0.4 1 0.6 1.00 100 0.8035779
## 0.4 1 0.6 1.00 150 0.8080723
## 0.4 1 0.8 0.50 50 0.7979725
## 0.4 1 0.8 0.50 100 0.8091959
## 0.4 1 0.8 0.50 150 0.8024669
## 0.4 1 0.8 0.75 50 0.7946394
## 0.4 1 0.8 0.75 100 0.8024669
## 0.4 1 0.8 0.75 150 0.8069613
## 0.4 1 0.8 1.00 50 0.7923985
## 0.4 1 0.8 1.00 100 0.8058314
## 0.4 1 0.8 1.00 150 0.8080723
## 0.4 2 0.6 0.50 50 0.8271734
## 0.4 2 0.6 0.50 100 0.8170674
## 0.4 2 0.6 0.50 150 0.8260436
## 0.4 2 0.6 0.75 50 0.8193271
## 0.4 2 0.6 0.75 100 0.8283159
## 0.4 2 0.6 0.75 150 0.8204507
## 0.4 2 0.6 1.00 50 0.8283033
## 0.4 2 0.6 1.00 100 0.8305505
## 0.4 2 0.6 1.00 150 0.8282970
## 0.4 2 0.8 0.50 50 0.8226979
## 0.4 2 0.8 0.50 100 0.8249200
## 0.4 2 0.8 0.50 150 0.8193271
## 0.4 2 0.8 0.75 50 0.8159563
## 0.4 2 0.8 0.75 100 0.8215743
## 0.4 2 0.8 0.75 150 0.8238089
## 0.4 2 0.8 1.00 50 0.8305505
## 0.4 2 0.8 1.00 100 0.8305505
## 0.4 2 0.8 1.00 150 0.8327851
## 0.4 3 0.6 0.50 50 0.8226791
## 0.4 3 0.6 0.50 100 0.8170674
## 0.4 3 0.6 0.50 150 0.8204381
## 0.4 3 0.6 0.75 50 0.8282970
## 0.4 3 0.6 0.75 100 0.8282970
## 0.4 3 0.6 0.75 150 0.8249325
## 0.4 3 0.6 1.00 50 0.8271797
## 0.4 3 0.6 1.00 100 0.8282970
## 0.4 3 0.6 1.00 150 0.8294143
## 0.4 3 0.8 0.50 50 0.8181972
## 0.4 3 0.8 0.50 100 0.8238026
## 0.4 3 0.8 0.50 150 0.8215617
## 0.4 3 0.8 0.75 50 0.8238089
## 0.4 3 0.8 0.75 100 0.8215617
## 0.4 3 0.8 0.75 150 0.8193145
## 0.4 3 0.8 1.00 50 0.8271797
## 0.4 3 0.8 1.00 100 0.8271797
## 0.4 3 0.8 1.00 150 0.8260498
## Kappa
## 0.5473285
## 0.5515389
## 0.5771336
## 0.5595562
## 0.5794570
## 0.5882261
## 0.5618207
## 0.5595562
## 0.5747656
## 0.5676582
## 0.5952334
## 0.5737954
## 0.5703690
## 0.5670234
## 0.5781611
## 0.5618207
## 0.5595562
## 0.5622150
## 0.6058926
## 0.6175551
## 0.6136591
## 0.6106364
## 0.6113221
## 0.6111930
## 0.6107888
## 0.6161225
## 0.6236951
## 0.6189170
## 0.6170220
## 0.6063361
## 0.6157929
## 0.6164965
## 0.6161002
## 0.6108548
## 0.6152356
## 0.6232144
## 0.6226650
## 0.6155076
## 0.6208394
## 0.6233325
## 0.6085824
## 0.6210634
## 0.6205710
## 0.6183942
## 0.6188631
## 0.5997443
## 0.6093756
## 0.6142927
## 0.6112374
## 0.6165809
## 0.6169772
## 0.6147082
## 0.6169772
## 0.6169772
## 0.5418517
## 0.5676398
## 0.5892909
## 0.5665360
## 0.5841003
## 0.5860053
## 0.5564300
## 0.5800226
## 0.5879721
## 0.5677109
## 0.5925673
## 0.5794320
## 0.5629599
## 0.5786504
## 0.5878756
## 0.5574397
## 0.5827756
## 0.5890129
## 0.6181394
## 0.5984092
## 0.6130238
## 0.5970644
## 0.6160487
## 0.6031717
## 0.6161565
## 0.6214261
## 0.6165809
## 0.6054100
## 0.6103058
## 0.6019137
## 0.5924211
## 0.6040831
## 0.6102085
## 0.6214261
## 0.6214261
## 0.6259773
## 0.6046745
## 0.5983520
## 0.6028083
## 0.6161002
## 0.6169772
## 0.6094491
## 0.6147082
## 0.6164965
## 0.6192594
## 0.5962309
## 0.6092659
## 0.6037124
## 0.6085824
## 0.6036752
## 0.5984161
## 0.6147082
## 0.6147082
## 0.6117182
##
## Tuning parameter 'gamma' was held constant at a value of 0
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 2,
## eta = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1
## and subsample = 1.
The cross-validation accuracy of tuned eXtreme Gradient Boosting (XGBoost) Tree model is about 83%.
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1
## 0 58.6 13.7
## 1 3.0 24.7
##
## Accuracy (average) : 0.8328
On average, of the 549 passengers who did not survive, XGBoostTree model can correctly classify 522 of them (58.6% of 891). Whereas, of the 342 passengers who survived, this model can correctly classify 220 of them (24.7% of 891).
There are 2 features that are always given higher priority in fitting the models. These features are probably the most important predictors, they are TitleMr and Pclass3. As pointed out in EDA, Mr had the lowest survival rate (15.7%) amongst all the Title groups, whereas 3rd class passengers had the lowest survival rate (24.2%) amongst all the ticket classes. Hence, it is no surprise that when passengers fall into these categories, their odds of survival are very low.
Fig 3: Models Comparison
XGBoostTree, Random Forest, Perceptron, k-NN, and Decision Tree models perform better than Logistic Regression model. This is probably because these models can capture more patterns than Logistic Regression model (without non-linear extension) which can only capture linear patterns. Of the 2224 passengers on-boarded Titanic, 891 observations are used in the above analyses. The cross-validation accuracy provides a realistic estimation of test accuracy, viz, how well the chosen model performs when exposed to previously unseen data (the remaining 1333 observations). However, the true test accuracy of the model and how well the model generalizes can only be assessed by exposing it to other unseen observations.
Lastly, there are 3 features that are not being exploited here due to complexity. One of them is the family name of the passengers, the family names might help to explore in detail the relationship amongst the passengers and might even provide insights into whether they were all stayed in the same cabin and whether the younger family members survived the most. The other unexploited features are Cabin and Ticket code. The ticket codes might provide deeper insights into the Pclass and probably associated with cabins. The cabins might provide a large clue as to which cabins are closer to the upper deck and hence, whether the passengers stayed in these cabins made it to the upper deck faster and got to escape the ship faster. These are the features that can be exploited and might even help to increase the predictive power of the models.
2.0.1 Social Titles of Passengers
Notice that the Name variable does not only contain the first, middle and last names of the passengers, it also contains the social titles of the passengers (Mr., Mrs., Miss., Col., etc.). These titles related to the passengers’ gender, age, marital status, and contains military ranks, clergy, Royal and other Nobles. These titles might be helpful in representing the social status of the passengers and it may help to predict if a person with Noble status is more likely to survive or not. Below are all the titles that can be extracted in this dataset:
There are various social titles and we want to group them into few representative categories. First off, Master is a title used to address young boys who are not old enough to be addressed as Mister (Mr.) and since there are 40 of them, we will maintain Master as a group itself. Mr is also a group itself. Next, it is hard to classify equivalent young ladies since there isn’t general title for them. So we can classify the women into married and unmarried. Miss, Ms, and Mlle (Mademoiselle in French) are to be classified as Miss (unmarried women) whereas Mrs and Mme (Madame in French) are to be classified as Mrs (married women). Lastly, the military, clergy, Royal, and other Noble titles are to be grouped together as Noble.